32 research outputs found

    Building Hyper View web sites

    Get PDF
    In this report a framework for building “virtual” web sites using the HyperView system is presented. Virtual web sites are web sites that offer information extracted and integrated from other web sites on the fly. The HyperView system already supports the demand-driven integration of information from different semistructured information sources into a graph database. The problem we are dealing with here is to query the database and generate HTML pages from the results as a response to HTTP requests received from the user. The returned HTML pages should hide the aspects of data extraction and integration and should give the user the impression of a single, coherent web site. We show first how HyperViews comprised of graph-transformation rules can be defined that generate HTML pages from the database. This way web sites for individual application schemata can be designed. In the second part we present a generic rule set that defines a web interface for HyperView graph databases with arbitrary schemata. This generic web interface can be customized for the particular application by annotating the database schema and chosing appropriate styles. The work presented in this report completes the HyperView approach in the sense that it closes the circle of extracting and integrating information from the web by again publishing the integrated data on the web. Our approach applies as well to the integration and generation of XML documents on the web

    Concept Embedding for Relevance Detection of Search Queries Regarding CHOP.

    Get PDF
    Automatic encoding of diagnosis and procedures can increase the interoperability and efficacy of the clinical cooperation. The concept, rule-based and machine learning classification methods for automatic code generation can easily reach their limit due to the handcrafted rules and a limited coverage of the vocabulary in a concept library. As the first step to apply deep learning methods in automatic encoding in the clinical domain, a suitable semantic representation should be generated. In this work, we will focus on the embedding mechanism and dimensional reduction method for text representation, which mitigate the sparseness of the data input in the clinical domain. Different methods such as word embedding and random projection will be evaluated based on logs of query-document matching

    litsift: Automated Text Categorization in Bibliographic Search

    Get PDF
    In bioinformatics there exist research topics that cannot be uniquely characterized by a set of key words because relevant key words are (i) also heavily used in other contexts and (ii) often omitted in relevant documents because the context is clear to the target audience. Information retrieval interfaces such as entrez/Pubmed produce either low precision or low recall in this case. To yield a high recall at a reasonable precision, the results of a broad information retrieval search have to be filtered to remove irrelevant documents. We use automated text categorization for this purpose. In this study we use the topic of conserved secondary RNA structures in viral genomes as running example. Pubmed result sets for two virus groups, Picornaviridae and Flaviviridae, have been manually labeled by human experts. We evaluated various classifiers from the Weka toolkit together with different feature selection methods to assess whether classifiers trained on documents dedicated to one virus group can be successfully applied to filter literature on other virus groups. Our results indicate that in this domain a bibliographic search tool trained on a reference corpus may significantly reduce the amount of time needed for extensive literature recherches

    Systematic feature evaluation for gene name recognition

    Get PDF
    In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features

    Efficient Multi-Profile Filtering using Finite Automata

    No full text
    The task of an alerting system is to observe events, produce a strea

    The Formal Framework of the HyperView System

    Get PDF
    In this report, we introduce the graph rewriting formalism on which the HyperView System is based. We first present a data model for clustered graphs and our notion of graph schemata and graph layers. Then we formalize our concept of nondeleting typed graph rewriting rules with application conditions on attributes based on the Algebraic Single Push Out Approach to graph transformation and present the construction of the derived graph resulting from applying a rule. The main contribution of this report is the formalization of an efficient strategy for materializing HyperViews based on demand-driven rule activation. We introduce the notion of an oracle against which queries in form of graph patterns can be posed. We show how to combine the rule set of a HyperView with an oracle to form a more powerful oracle that materializes this HyperView as a response to queries against it. Finally we treat the problem of avoiding the introduction of redundancies in view graphs by reusing already m..

    Der HyperView-Ansatz zur Integration semistrukturierter Daten

    No full text
    Title page, contents 1 Introduction 1.1 Integration of semistructured information sources Virtual Web Sites 1.2 The HyperView approach 1.2.1 Data Model and View Mechanism 1.2.2 Architecture 1.2.3 Application of the HyperView Technology 1.3 Related Work (Overview) 1.4 Overview 2 HyperView by Example: Wrapping Publisher Web Sites 2.1 Digital Libraries of Electronic Journals 2.1.1 The DARWIN project 2.1.2 Use cases 2.2 Modeling publisher Web Sites 2.2.1 Generic approach 2.2.2 Graph Schemata 2.2.3 The HyperView Database Schema 2.2.4 ACR Schemata of Example Web Sources 2.2.5 Representing HTML Pages as HTML graphs 2.3 Building Views on publisher Web Sites 2.3.1 Queries and Rules 2.3.2 Defining a View over the HTML Graphs 2.3.3 Defining a View over the ACR Graphs 2.3.4 Querying the HyperView system 2.4 The Architecture of DARWIN 2.5 Summary 3 Formal Framework 3.1 Clustered Graph Data Model (CGDM) 3.1.1 Motivation 3.1.2 Basic definitions 3.1.3 Schemata and instances 3.2 Rules 3.2.1 Rule application 3.3 Queries and Oracles 3.3.1 Applying a rule to a virtual data graph 3.3.2 Hyperviews 3.3.3 Using a rule to answer a subquery 3.3.4 Chaining rules to answer a query 3.4 Reuse of existing subgraphs 3.5 Bibliography on Graph-Transformation 3.6 Summary 4 The HyperView System 4.1 Encoding of Graphs 4.1.1 Plain Graphs 4.1.2 Clustered Graphs 4.1.3 Type checking 4.2 Encoding of Queries 4.3 Encoding of Rules 4.4 Rule Activation 4.5 Query execution 4.6 Complexity and Performance 4.7 Metadata management 4.7.1 Schema clusters 4.7.2 The `meta` cluster 4.7.3 WWW meta data 4.8 The HyperView System prototype 4.9 Summary 5 The HVQL Query Language 5.1 Introduction 5.2 Basic Notations 5.3 Graph Patterns 5.4 Graph Literals 5.5 Queries 5.5.1 Syntax 5.5.2 Semantics 5.5.3 Implementation 5.6 Rules 5.6.1 Syntax 5.6.2 Semantics 5.6.3 Implementation 5.6.4 Example 5.7 Meta Edges 5.8 HTML Edges 5.9 Embedding of HVQL in the HyperView System 5.10 Summary 6 Support for Web Interfaces 6.1 Introduction 6.2 Architecture of the HyperView Web server 6.3 Conceptual model of the virtual HyperView Web site 6.4 HTML Code Generation 6.4.1 Phase 1: Preparation 6.4.2 Phase 2: Generation of a HTML skeleton 6.4.3 Phase 3: HTML dump and generation of variable HTML code 6.4.4 HVQL notation for HTML rules 6.5 The HyperView Browser 6.5.1 Customization 6.6 Summary 7 Case Study: Town Information 7.1 Introduction 7.2 Scenario 7.2.1 Use Case 7.3 Developing a cultural event calendar 7.3.1 Conceptual schema 7.3.2 Wrapping town information sites 7.4 The cultural calendar Web site 7.5 Summary 8 The HyperView Methodology 8.1 User roles 8.2 Content Specification 8.3 The Design Space of HyperView 8.4 Schema development 8.4.1 HTML layer 8.4.2 ACR layer 8.4.3 Database layer 8.4.4 UI layer 8.5 View development 8.5.1 Implementing HTML views 8.5.2 ACR Views 8.5.3 DB Views 8.6 Maintenance 8.6.1 Robustness 8.6.2 Error detection 8.6.3 Adaption 8.7 Summary 9 Discussion and Outlook 9.1 Related Work 9.1.1 Data models and schemata for semistructured data 9.1.2 Data Extraction from Semistructured Documents 9.1.3 Querying the Web 9.1.4 Integration of Heterogeneous Data Sources 9.1.5 Related applications of Graph- Transformation techniques 9.1.6 Comparison with HyperView 9.2 Future Applications: XML & RDF 9.2.1 XML 9.2.2 XML Parsing 9.2.3 XML DTD s and schemata 9.2.4 XPointer and XQL 9.2.5 Extensible Stylesheet Language 9.2.6 Channel Definition Format 9.2.7 Resource Description Framework (RDF) 9.2.8 RDF Schemata 9.2.9 Summary 9.3 Open Issues 9.3.1 Theoretical Issues 9.3.2 Integration Issues 9.3.3 Implementation and Performance Issues 9.3.4 Interface Issues 9.4 Contributions and Outlook 9.5 Acknowledgments Bibliography Table of Mathematical Symbols Zusammenfassung der Ergebnisse Lebenslauf Verwendete HilfsmittelUsing the World Wide Web to answer a specific question often requires information to be collected from multiple heterogeneous Web sites. Virtual Web sites are a promising approach to automate this task for particular, focused application domains. A virtual Web site serves pages containing concentrated information that has been extracted, homogenized, and combined from several underlying Web sites. The HyperView approach to the integration of semistructured data presented in this thesis provides a methodology, a formal framework, and a software environment for building such virtual Web sites. The HyperView approach treats the three steps of data extraction, integration, and presentation uniformly as consecutive views that map between different levels of abstraction. These levels are reflected by the architectural layers of the system. The contents of Web sites as well as the consecutive views are represented as graphs. Views are defined by sets of graph transformation rules. A demand-driven rule activation mechanism has been formally described and implemented. This mechanism incrementally materializes views in response to queries issued against them. The HyperView System has been implemented in Prolog. Graph transformation rules are compiled into efficient Prolog predicates. Java servlets are used to support virtual Web sites. The main contributions of this thesis are: 1\. the key idea of applying the same view mechanism uniformly to the problems of extraction, integration, and presentation, 2\. the HyperView methodology for modeling and integrating Web sites, 3\. the formal framework defining the data model, rule concept, and the demand-driven view materialization mechanism of HyperView, 4\. the HyperView System prototype providing a platform for building virtual integrated Web sites 5\. the validation of the HyperView methodology and system in case studies on Digital Libraries and Town Information.Die Beantwortung konkreter Fragen per World Wide Web erfordert häufig das Zusammentragen und Kombinieren von Informationen aus mehreren Web-Sites. Virtuelle Web Sites versprechen, diese Aufgabe zumindest für begrenzte Anwendungsbereiche zu automatisieren. Ein virtueller Web Site bietet Informationen, die aus zugrundeliegenden Web Sites extrahiert, vereinheitlicht, und integriert wurden. Der HyperView-Ansatz zur Integration von semistrukturierten Daten besteht aus einer Methodik, einem mathematischen Formalismus und einer Software-Umgebung für die Realisierung virtueller Web Sites. Im HyperView-Ansatz werden die drei Schritte der Extrahierung, Integration und Präsentation der Daten als aufeinanderfolgende Sichten (Views) aufgefaßt, welche die Abstraktionsebenen der HyperView-Architektur aufeinander abbilden. Der Inhalt jeder Schicht wird durch Graphen repräsentiert. Sichten werden durch Mengen von Graphtransformationsregeln definiert. Ein bedarfsgesteuerter Mechanismus zur Aktivierung dieser Regeln wurde formal beschrieben und implementiert. Dieser Mechanismus materialisiert Sichten inkrementell, in Reaktion auf Anfragen. Das HyperView System ist in Prolog implementiert. Graphtransformationsregeln werden in effiziente Prolog-Prädikate kompiliert. Java Servlets werden für die Generierung von HTML-Seiten genutzt. Die Hauptergebnisse dieser Arbeit sind: 1\. der Nachweis, daß die Probleme der Daten-Extraktion, -Integration, und -Präsentation mit einem einheitlichen Abbildungs-Mechanismus gelöst werden können, 2\. die HyperView-Methodik für die Modellierung und Integration von Web-Sites, 3\. die formale Definition des Datenmodells, des Regelkonzepts und des bedarfsgesteuerten Mechanismus für die Materialisierung von Sichten, 4\. die Implementierung des HyperView System s als einer Plattform für die Errichtung virtueller Web-Sites, und 5\. die Validierung der HyperView-Methodik und des HyperView System s in Fallstudien zu Digitalen Bibliotheken und Stadtinformationen

    Building HyperView wrappers for publisher web-sites

    No full text
    1 Introduction The number of electronic journals is rapidly increasing. Publishers currently ooeer several thousand electronic journals. For researchers, the advent of online journal editions has made life much easier since all information is reachable immediately from the own desktop. On the other hand, this huge amount of heterogeneously structured information makes conventional methods for dealing with it such as book-marking and browsing of publisher Web Sites inadequate

    Storing and Querying Historical Texts in a Relational Database

    Get PDF
    Diese Arbeit beschreibt einen Ansatz für die Speicherung und Anfrage eines großen Korpus linguistisch annotierter historischer Texte mit Hilfe eines relationalen Datenbanksystems. Texte in solch einem Korpus haben eine reichhaltige Struktur bestehend aus mehreren Text-Ebenen die detailliert annotiert und miteinander aligniert sein können. Die Modellierung und Verwaltung solcher Korpora bereitet diverse Herausforderungen, die bei einfacheren Textsammlungen nicht auftreten. Eine besondere Herausforderung ist das Design und die Implementierung einer geeigneten Anfragesprache für solche komplexen Annotationsstrukturen. In diesem Bericht beschreiben wir erste Schritte in diese Richtung. Wir stellen ein Datenmodell und Speicherkonzept für beliebig komplexe linguistische Annotationsschemata über in unterschiedlichsten Transliterationen und Varianten vorliegenden Texten vor. Wir identifizieren die primären Anforderungen für eine Anfragesprache auf solchen linguistischen Annotationen. Aus diesen Anforderungen leiten wir elementare Anfrageoperatoren ab und skizzieren ihre Implementierung in unserem Speicherkonzept. Weiterhin diskutieren wir erste Ideen zur Optimierung einer auf relationalen Datenbanken und XML-Techniken basierenden Implementierung.This paper describes an approach for storing and querying a large corpus of linguistically annotated historical texts in a relational database management system. Texts in such a corpus have a complex structure consisting of multiple text layers that are richly annotated and aligned to each other. Modeling and managing such corpora poses various challenges not present in simpler text collections. In particular, it is a difficult task to design and efficiently implement a query language for such complex annotation structures that fulfills the requirements of linguists and philologists. In this report, we describe steps towards a solution of this task. We describe a model for storing arbitrarily complex linguistic annotation schemes for text. The text itself may be present in various transliterations, transcriptions, or editions. We identify the main requirements for a query language on linguistic annotations in this scenario. From these requirements, we derive fundamental query operators and sketch their implementation in our model. Furthermore, we discuss initial ideas for improving the efficiency of an implementation based on relational databases and XML techniques
    corecore